# colab related
!pip install matplotlib --upgrade
!pip install sklearn --upgrade
Defaulting to user installation because normal site-packages is not writeable Requirement already satisfied: matplotlib in /home/giovo17/.local/lib/python3.10/site-packages (3.5.3) Requirement already satisfied: cycler>=0.10 in /home/giovo17/.local/lib/python3.10/site-packages (from matplotlib) (0.11.0) Requirement already satisfied: fonttools>=4.22.0 in /home/giovo17/.local/lib/python3.10/site-packages (from matplotlib) (4.37.1) Requirement already satisfied: kiwisolver>=1.0.1 in /home/giovo17/.local/lib/python3.10/site-packages (from matplotlib) (1.4.4) Requirement already satisfied: numpy>=1.17 in /home/giovo17/.local/lib/python3.10/site-packages (from matplotlib) (1.23.2) Requirement already satisfied: packaging>=20.0 in /usr/lib/python3.10/site-packages (from matplotlib) (21.3) Requirement already satisfied: pillow>=6.2.0 in /usr/lib/python3.10/site-packages (from matplotlib) (9.2.0) Requirement already satisfied: pyparsing>=2.2.1 in /usr/lib/python3.10/site-packages (from matplotlib) (3.0.9) Requirement already satisfied: python-dateutil>=2.7 in /home/giovo17/.local/lib/python3.10/site-packages (from matplotlib) (2.8.2) Requirement already satisfied: six>=1.5 in /usr/lib/python3.10/site-packages (from python-dateutil>=2.7->matplotlib) (1.16.0) Defaulting to user installation because normal site-packages is not writeable Requirement already satisfied: sklearn in /home/giovo17/.local/lib/python3.10/site-packages (0.0) Requirement already satisfied: scikit-learn in /home/giovo17/.local/lib/python3.10/site-packages (from sklearn) (1.1.2) Requirement already satisfied: numpy>=1.17.3 in /home/giovo17/.local/lib/python3.10/site-packages (from scikit-learn->sklearn) (1.23.2) Requirement already satisfied: scipy>=1.3.2 in /home/giovo17/.local/lib/python3.10/site-packages (from scikit-learn->sklearn) (1.9.1) Requirement already satisfied: joblib>=1.0.0 in /home/giovo17/.local/lib/python3.10/site-packages (from scikit-learn->sklearn) (1.1.0) Requirement already satisfied: threadpoolctl>=2.0.0 in /home/giovo17/.local/lib/python3.10/site-packages (from scikit-learn->sklearn) (3.1.0)
# Importing main libraries
import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import VarianceThreshold
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, ParameterGrid
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.cluster import KMeans
from sklearn.neural_network import MLPClassifier
from sklearn import metrics
from sklearn.metrics import confusion_matrix, roc_auc_score, RocCurveDisplay
plt.rcParams.update({'figure.figsize': (10.0, 10.0)})
plt.rcParams.update({'font.size': 12})
#plt.rcParams.update({'figure.dpi': 300})
data_source = "https://raw.githubusercontent.com/Giovo17/cardio-disease-analysis/main/cardio_train.csv"
df = pd.read_csv(data_source, sep=";", index_col="id")
df = df.rename(columns={"ap_hi": "systolic_bp", "ap_lo": "diastolic_bp",
"gluc": "glucose", "alco": "alcool_intake",
"active": "physical_activity", "cardio": "cardio_disease"})
df.head()
| age | gender | height | weight | systolic_bp | diastolic_bp | cholesterol | glucose | smoke | alcool_intake | physical_activity | cardio_disease | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| id | ||||||||||||
| 0 | 18393 | 2 | 168 | 62.0 | 110 | 80 | 1 | 1 | 0 | 0 | 1 | 0 |
| 1 | 20228 | 1 | 156 | 85.0 | 140 | 90 | 3 | 1 | 0 | 0 | 1 | 1 |
| 2 | 18857 | 1 | 165 | 64.0 | 130 | 70 | 3 | 1 | 0 | 0 | 0 | 1 |
| 3 | 17623 | 2 | 169 | 82.0 | 150 | 100 | 1 | 1 | 0 | 0 | 1 | 1 |
| 4 | 17474 | 1 | 156 | 56.0 | 100 | 60 | 1 | 1 | 0 | 0 | 0 | 0 |
There are 3 types of input features:
| Feature | Feature type | Name in dataset | Data type |
|---|---|---|---|
| Age | Objective Feature | age | int (days) |
| Gender | Objective Feature | gender | categorical code (1: female, 2: male) |
| Height | Objective Feature | height | int (cm) |
| Weight | Objective Feature | weight | float (kg) |
| Systolic blood pressure | Examination Feature | systolic_bp | int |
| Diastolic blood pressure | Examination Feature | diastolic_bp | int |
| Cholesterol | Examination Feature | cholesterol | 1: normal, 2: above normal, 3: well above normal |
| Glucose | Examination Feature | glucose | 1: normal, 2: above normal, 3: well above normal |
| Smoking | Subjective Feature | smoke | binary |
| Alcohol intake | Subjective Feature | alcool | binary |
| Physical activity | Subjective Feature | physical_activity | binary |
| Presence or absence of cardiovascular disease | Target Variable | cardio_disease | binary |
All of the dataset values were collected at the moment of medical examination.
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 70000 entries, 0 to 99999 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 70000 non-null int64 1 gender 70000 non-null int64 2 height 70000 non-null int64 3 weight 70000 non-null float64 4 systolic_bp 70000 non-null int64 5 diastolic_bp 70000 non-null int64 6 cholesterol 70000 non-null int64 7 glucose 70000 non-null int64 8 smoke 70000 non-null int64 9 alcool_intake 70000 non-null int64 10 physical_activity 70000 non-null int64 11 cardio_disease 70000 non-null int64 dtypes: float64(1), int64(11) memory usage: 6.9 MB
There are no missing values
df["gender"] = df["gender"].map({1: 0, 2: 1})
Add the BMI feature from height and weight
BMI = df["weight"] / (df["height"] / 100)**2
df.insert (4, "BMI", BMI)
Convert age in years
df["age"] = (df["age"]/365).astype(int)
df.head()
| age | gender | height | weight | BMI | systolic_bp | diastolic_bp | cholesterol | glucose | smoke | alcool_intake | physical_activity | cardio_disease | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| id | |||||||||||||
| 0 | 50 | 1 | 168 | 62.0 | 21.967120 | 110 | 80 | 1 | 1 | 0 | 0 | 1 | 0 |
| 1 | 55 | 0 | 156 | 85.0 | 34.927679 | 140 | 90 | 3 | 1 | 0 | 0 | 1 | 1 |
| 2 | 51 | 0 | 165 | 64.0 | 23.507805 | 130 | 70 | 3 | 1 | 0 | 0 | 0 | 1 |
| 3 | 48 | 1 | 169 | 82.0 | 28.710479 | 150 | 100 | 1 | 1 | 0 | 0 | 1 | 1 |
| 4 | 47 | 0 | 156 | 56.0 | 23.011177 | 100 | 60 | 1 | 1 | 0 | 0 | 0 | 0 |
Search for duplicated rows
print("Duplicate rows: {}".format(df.duplicated().sum()))
df = df.drop_duplicates()
Duplicate rows: 3208
df.shape
(66792, 13)
There are no missing values.
def map_values(dataframe, to_numeric=False):
if to_numeric:
dataframe["gender"] = dataframe["gender"].map({"female": 0, "male": 1})
dataframe["cholesterol"] = dataframe["cholesterol"].map({"normal": 1, "above normal": 2, "well above normal": 3})
dataframe["glucose"] = dataframe["glucose"].map({"normal": 1, "above normal": 2, "well above normal": 3})
dataframe["smoke"] = dataframe["smoke"].map({"no": 0, "yes": 1})
dataframe["alcool_intake"] = dataframe["alcool_intake"].map({"no": 0, "yes": 1})
dataframe["physical_activity"] = dataframe["physical_activity"].map({"inactive": 0, "active": 1})
dataframe["cardio_disease"] = dataframe["cardio_disease"].map({"healthy": 0, "sick": 1})
else:
dataframe["gender"] = dataframe["gender"].map({0: "female", 1: "male"})
dataframe["cholesterol"] = dataframe["cholesterol"].map({1: "normal", 2: "above normal", 3: "well above normal"})
dataframe["glucose"] = dataframe["glucose"].map({1: "normal", 2: "above normal", 3: "well above normal"})
dataframe["smoke"] = dataframe["smoke"].map({0: "no", 1: "yes"})
dataframe["alcool_intake"] = dataframe["alcool_intake"].map({0: "no", 1: "yes"})
dataframe["physical_activity"] = dataframe["physical_activity"].map({0: "inactive", 1: "active"})
dataframe["cardio_disease"] = dataframe["cardio_disease"].map({0: "healthy", 1: "sick"})
return dataframe
df = map_values(df, to_numeric=False)
def get_percentages(ax_container):
perc = []
sum = 0
for k in ax_container:
sum += k.get_height()
for k in ax_container:
lab = str(k.get_height()) + " (" + str(round(k.get_height() / sum * 100, 1)) + " %)"
perc.append(lab)
return perc
ax = sns.countplot(data=df, x="cardio_disease")
ax.bar_label(ax.containers[0], get_percentages(ax.containers[0]))
ax.set_title("Cardio disease countplot")
plt.show()
The target variable cardio_disease is balanced
ax = sns.countplot(data=df, x="gender", hue="cardio_disease")
ax.bar_label(ax.containers[0], get_percentages(ax.containers[0]))
ax.bar_label(ax.containers[1], get_percentages(ax.containers[1]))
ax.set_title("Gender countplot - cardio hue")
plt.show()
The gender of the patient doesn't seem to have a noticeable correlation with the target variable
ax = sns.countplot(data=df, x="cholesterol", hue="cardio_disease", order=["normal", "above normal", "well above normal"])
ax.bar_label(ax.containers[0], get_percentages(ax.containers[0]))
ax.bar_label(ax.containers[1], get_percentages(ax.containers[1]))
ax.set_title("Cholesterol countplot - cardio hue")
plt.show()
As it can be seen on the graph the cholesterol has an impact on the target variable.
ax = sns.countplot(data=df, x="glucose", hue="cardio_disease", order=["normal", "above normal", "well above normal"])
ax.bar_label(ax.containers[0], get_percentages(ax.containers[0]))
ax.bar_label(ax.containers[1], get_percentages(ax.containers[1]))
ax.set_title("Glucose countplot - cardio hue")
plt.show()
The trend for the glucose variable is the same of the cholesterol one, but the differences between healthy and sick patients are more subtle.
ax = sns.countplot(data=df, x="smoke", hue="cardio_disease")
ax.bar_label(ax.containers[0], get_percentages(ax.containers[0]))
ax.bar_label(ax.containers[1], get_percentages(ax.containers[1]))
ax.set_title("Smoke countplot - cardio hue")
plt.show()
The smoke feature looks like it's uncorrelated with the health of the patient
ax = sns.countplot(data=df, x="alcool_intake", hue="cardio_disease")
ax.bar_label(ax.containers[0], get_percentages(ax.containers[0]))
ax.bar_label(ax.containers[1], get_percentages(ax.containers[1]))
ax.set_title("Alcool intake countplot - cardio hue")
plt.show()
ax = sns.countplot(data=df, x="physical_activity", hue="cardio_disease")
ax.bar_label(ax.containers[0], get_percentages(ax.containers[0]))
ax.bar_label(ax.containers[1], get_percentages(ax.containers[1]))
ax.set_title("Physical activity countplot - cardio hue")
plt.show()
Alcool intake and physical activity follow the same path of the smoke feature
Age
ax = sns.boxplot(data=df, x="age", y="cardio_disease", orient="h")
ax.set_title("Age boxplot - cardio hue")
plt.show()
df.groupby("cardio_disease")["age"].describe()
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| cardio_disease | ||||||||
| healthy | 32599.0 | 51.220191 | 6.847514 | 29.0 | 46.0 | 52.0 | 57.0 | 64.0 |
| sick | 34193.0 | 54.422835 | 6.380788 | 39.0 | 50.0 | 55.0 | 60.0 | 64.0 |
plt.subplot(2, 1, 1)
ax = sns.histplot(data=df, x="age", bins=20)
ax.set_title("Age histogram")
plt.subplot(2, 1, 2)
ax = sns.histplot(data=df, x="age", bins=20, hue="cardio_disease")
ax.set_title("Age histogram - cardio hue")
plt.subplots_adjust(top=1)
plt.show()
The patients age ranges from 29 to 64, so there are only adults.
The distribution seems to be bimodal with modes around 55 and 58.
Taking into consideration the conditional boxplot this feature is slightly related to the target variable, in fact the "cardio_disease affected" patients box is has higher minimum, 1° quartile, median and 3° quartile with respect to healthy patients. Observing the conditional histplot this trend is confirmed: there's an higher concentration of unhealthy patients as the age increases. Though in both there's a considerable overlap between the two classes.
Height, weight and BMI
The BMI is the Body Mass Index and it's defined as $ BMI = \frac{w}{h^2} $ where $w$ is the weight in kilograms and $h$ is the height in meters
A reference table from Ministero della salute:
| Condition | BMI |
|---|---|
| Severe thinness | BMI < 16 |
| Underweight | 16 < BMI < 18.49 |
| Normal weight | 18.5 < BMI < 24.99 |
| Overweight | 25 < BMI < 29.99 |
| Obese class 1 | 30 < BMI < 34.99 |
| Obese class 2 | 35 < BMI < 39.99 |
| Obese class 3 | BMI > 40 |
plt.subplot(2, 1, 1)
ax = sns.histplot(data=df, x="height", bins=50)
ax.set_xlabel("height ($cm$)")
ax.set_title("Height histogram")
plt.subplot(2, 1, 2)
ax = sns.histplot(data=df, x="weight", bins=50)
ax.set_xlabel("weight ($kg$)")
ax.set_title("Weight histogram")
plt.subplots_adjust(top=1)
plt.show()
plt.subplot(2, 1, 1)
ax = sns.boxplot(data=df, x="height")
ax.set_xlabel("height ($cm$)")
ax.set_title("Height boxplot")
plt.subplot(2, 1, 2)
ax = sns.boxplot(data=df, x="weight")
ax.set_xlabel("weight ($kg$)")
ax.set_title("Weight boxplot")
plt.subplots_adjust(top=1)
plt.show()
pd.DataFrame(df["height"].describe(percentiles=[0.01, 0.25, 0.5, 0.75, 0.999])).T
| count | mean | std | min | 1% | 25% | 50% | 75% | 99.9% | max | |
|---|---|---|---|---|---|---|---|---|---|---|
| height | 66792.0 | 164.341748 | 8.333904 | 55.0 | 146.0 | 159.0 | 165.0 | 170.0 | 190.0 | 250.0 |
pd.DataFrame(df["weight"].describe(percentiles=[0.01, 0.25, 0.5, 0.75, 0.999])).T
| count | mean | std | min | 1% | 25% | 50% | 75% | 99.9% | max | |
|---|---|---|---|---|---|---|---|---|---|---|
| weight | 66792.0 | 74.52116 | 14.580675 | 10.0 | 48.0 | 65.0 | 72.0 | 83.0 | 150.0 | 200.0 |
Both height and weight are unimodal distributions with modes around 165 (cm) and 65 (kg) respectively.
The height distribution looks like it isn't skewed. The weight distribution seems to be slightly positive skewed as it's right tail it's a bit longer than the left one.
These features present outliers as it can be seen from the boxplot and the correspondig statistics table.
Checking weight skeweness:
print("Mode: {}, median: {}, mean: {}".format(stats.mode(df["weight"], keepdims=True)[0][0], np.median(df["weight"]), round(np.mean(df["weight"]), 2)))
print("Fisher-Pearson coefficient of skewness: {}".format(round(stats.skew(df["weight"]), 2)))
Mode: 70.0, median: 72.0, mean: 74.52 Fisher-Pearson coefficient of skewness: 0.97
ax = sns.boxplot(data=df, x="BMI")
ax.set_xlabel("BMI ($kg/m^2$)")
ax.set_title("BMI boxplot")
plt.show()
pd.DataFrame(df["BMI"].describe(percentiles=[0.25, 0.5, 0.75, 0.999])).T
| count | mean | std | min | 25% | 50% | 75% | 99.9% | max | |
|---|---|---|---|---|---|---|---|---|---|
| BMI | 66792.0 | 27.682565 | 6.184422 | 3.471784 | 23.875115 | 26.573129 | 30.46875 | 59.623333 | 298.666667 |
plt.subplot(2, 1, 1)
ax = sns.boxplot(data=df, x="BMI", y="cardio_disease", orient="h")
ax.set_xlabel("BMI ($kg/m^2$)")
ax.set_title("BMI boxplot - cardio hue")
plt.subplot(2, 1, 2)
ax = sns.boxplot(data=df, x="BMI", y="cardio_disease", orient="h")
ax.set_xlabel("BMI ($kg/m^2$)")
ax.set_xlim(13, 45)
plt.show()
df.groupby("cardio_disease")["BMI"].describe(percentiles=[0.25, 0.5, 0.75, 0.999])
| count | mean | std | min | 25% | 50% | 75% | 99.9% | max | |
|---|---|---|---|---|---|---|---|---|---|
| cardio_disease | |||||||||
| healthy | 32599.0 | 26.674952 | 5.755386 | 7.022248 | 23.372576 | 25.636917 | 29.060607 | 60.089236 | 237.768633 |
| sick | 34193.0 | 28.643205 | 6.421927 | 3.471784 | 24.560326 | 27.548209 | 31.615793 | 59.458581 | 298.666667 |
plt.subplot(2, 1, 1)
ax = sns.histplot(data=df, x="BMI", bins=100, kde=True)
ax.set_xlabel("BMI ($kg/m^2$)")
ax.set_title("BMI histogram")
plt.subplot(2, 1, 2)
ax = sns.histplot(data=df, x="BMI", bins=100, hue="cardio_disease")
ax.set_xlabel("BMI ($kg/m^2$)")
ax.set_title("BMI histogram - cardio hue")
plt.subplots_adjust(top=1)
plt.show()
ax = sns.histplot(data=df, x="BMI", bins=100, hue="cardio_disease")
ax.set_xlim(0, 70)
ax.set_xlabel("BMI ($kg/m^2$)")
ax.set_title("BMI histogram - cardio hue")
plt.show()
print("Mode: {}, median: {}, mean: {}".format(round(stats.mode(df["BMI"], keepdims=True)[0][0], 2), round(np.median(df["BMI"]), 2), round(np.mean(df["BMI"]), 2)))
print("Fisher-Pearson coefficient of skewness: {}".format(round(stats.skew(df["BMI"]), 2)))
Mode: 23.88, median: 26.57, mean: 27.68 Fisher-Pearson coefficient of skewness: 7.69
The BMI feature is unimodal (mode in 23.88) and positively skewed as shown by the Fisher-Pearson coefficient.
There are a lot of outliers in this features.
Systolic blood pressure and diastolic blood pressure
These features measures the pressure in arteries respectively when the heart beats and in a period between two heatbeats.
A reference table from heart.org:
| Blood pressure category | Systolic blood pressure (mm Hg) | and/or | Diastolic blood pressure (mm Hg) |
|---|---|---|---|
| Normal | systolic_bp < 120 | and | diastolic_bp < 80 |
| Elevated | 120 < systolic_bp < 129 | and | diastolic_bp < 80 |
| High blood pressure (Hypertension stage 1) | 130 < systolic_bp < 139 | or | 80 < diastolic_bp < 89 |
| High blood pressure (Hypertension stage 2) | systolic_bp > 140 | or | diastolic_bp > 90 |
| Hypertensive crisis | systolic_bp > 180 | and/or | diastolic_bp > 120 |
ax = sns.boxplot(data=df, x="systolic_bp")
ax.set_xlabel("systolic_bp ($mm Hg$)")
ax.set_title("Systolic blood pressure boxplot")
plt.show()
pd.DataFrame(df["systolic_bp"].describe(percentiles=[0.01, 0.25, 0.5, 0.75, 0.999])).T
| count | mean | std | min | 1% | 25% | 50% | 75% | 99.9% | max | |
|---|---|---|---|---|---|---|---|---|---|---|
| systolic_bp | 66792.0 | 129.231585 | 157.649354 | -150.0 | 90.0 | 120.0 | 120.0 | 140.0 | 220.0 | 16020.0 |
ax = sns.boxplot(data=df, x="systolic_bp", y="cardio_disease", orient="h")
ax.set_xlim(80, 180)
ax.set_xlabel("systolic_bp ($mm Hg$)")
ax.set_title("Systolic blood pressure boxplot - cardio hue")
plt.show()
df.groupby("cardio_disease")["systolic_bp"].describe(percentiles=[0.01, 0.25, 0.5, 0.75, 0.99])
| count | mean | std | min | 1% | 25% | 50% | 75% | 99% | max | |
|---|---|---|---|---|---|---|---|---|---|---|
| cardio_disease | ||||||||||
| healthy | 32599.0 | 120.528728 | 107.320611 | -120.0 | 90.0 | 110.0 | 120.0 | 120.0 | 160.0 | 14020.0 |
| sick | 34193.0 | 137.528734 | 193.460336 | -150.0 | 100.0 | 120.0 | 130.0 | 140.0 | 180.0 | 16020.0 |
plt.subplot(2, 1, 1)
ax = sns.histplot(data=df, x="systolic_bp", bins=1700)
ax.set_xlim(-50, 300)
ax.set_xlabel("systolic_bp ($mm Hg$)")
ax.set_title("Systolic blood pressure histogram")
plt.subplot(2, 1, 2)
ax = sns.histplot(data=df, x="systolic_bp", bins=1700, hue="cardio_disease")
ax.set_xlim(-50, 300)
ax.set_xlabel("systolic_bp ($mm Hg$)")
ax.set_title("Systolic blood pressure histogram - cardio hue")
plt.subplots_adjust(top=1)
plt.show()
ax = sns.boxplot(data=df, x="diastolic_bp")
ax.set_xlabel("diastolic_bp ($mm Hg$)")
ax.set_title("Diastolic blood pressure boxplot")
plt.show()
pd.DataFrame(df["diastolic_bp"].describe(percentiles=[0.01, 0.25, 0.5, 0.75, 0.999])).T
| count | mean | std | min | 1% | 25% | 50% | 75% | 99.9% | max | |
|---|---|---|---|---|---|---|---|---|---|---|
| diastolic_bp | 66792.0 | 97.446221 | 192.906434 | -70.0 | 60.0 | 80.0 | 80.0 | 90.0 | 1110.0 | 11000.0 |
ax = sns.boxplot(data=df, x="diastolic_bp", y="cardio_disease", orient="h")
ax.set_xlim(50, 120)
ax.set_xlabel("diastolic_bp ($mm Hg$)")
ax.set_title("Diastolic blood pressure boxplot - cardio hue")
plt.show()
df.groupby("cardio_disease")["diastolic_bp"].describe(percentiles=[0.01, 0.25, 0.5, 0.75, 0.99])
| count | mean | std | min | 1% | 25% | 50% | 75% | 99% | max | |
|---|---|---|---|---|---|---|---|---|---|---|
| cardio_disease | ||||||||||
| healthy | 32599.0 | 84.634743 | 158.248448 | 0.0 | 60.0 | 70.0 | 80.0 | 80.0 | 100.0 | 9800.0 |
| sick | 34193.0 | 109.660457 | 220.252704 | -70.0 | 60.0 | 80.0 | 80.0 | 90.0 | 1000.0 | 11000.0 |
plt.subplot(2, 1, 1)
ax = sns.histplot(data=df, x="diastolic_bp", bins=1300)
ax.set_xlim(-50, 250)
ax.set_xlabel("diastolic_bp ($mm Hg$)")
ax.set_title("Diastolic blood pressure")
plt.subplot(2, 1, 2)
ax = sns.histplot(data=df, x="diastolic_bp", bins=1300, hue="cardio_disease")
ax.set_xlim(-50, 250)
ax.set_xlabel("diastolic_bp ($mm Hg$)")
ax.set_title("Diastolic blood pressure - cardio hue")
plt.subplots_adjust(top=1)
plt.show()
These features have a similar behaviour. Both of them present outliers as it can be seen from the boxplots.
When not considering outliers, they are unimodal (mode around 120 mmHg and 80 mmHg respectively) and approximatevely simmetric. They are related to the cardio_disease feature as it can be seen from the boxplot and the histogram, though there's overlap between healthy and unhealthy patients.
sns.pairplot(data=df[["age", "height", "weight", "BMI", "systolic_bp", "diastolic_bp", "cardio_disease"]], hue='cardio_disease')
plt.show()
ax = sns.heatmap(df[["age", "height", "weight", "BMI", "systolic_bp", "diastolic_bp"]].corr(), annot=True, fmt=".2g", linewidths=.1, center=0)
ax.set_title("Numeric variables correlation matrix")
plt.show()
Analizing relationship between ap_hi and ap_lo
ax = sns.scatterplot(data=df, x="systolic_bp", y="diastolic_bp")
ax.set_xlim(50, 250)
ax.set_ylim(20, 200)
ax.set_xlabel("systolic_bp ($mm Hg$)")
ax.set_ylabel("diastolic_bp ($mm Hg$)")
ax.set_title("Systolic vs diastolic blood pressure scatterplot")
plt.show()
# removing systolic_bp and diastolic_bp outliers
df_cleaned = df.copy()
df_cleaned = df_cleaned[(np.abs(stats.zscore(df_cleaned["systolic_bp"])) < 1.5)]
df_cleaned = df_cleaned[df_cleaned.systolic_bp > 0]
df_cleaned = df_cleaned[(np.abs(stats.zscore(df_cleaned["diastolic_bp"])) < 1.5)]
df_cleaned = df_cleaned[df_cleaned.diastolic_bp > 0]
print("Original dataset: {}".format(round(np.corrcoef(df["systolic_bp"], df["diastolic_bp"])[0][1], 3)))
print("Dataset without outliers: {}".format(round(np.corrcoef(df_cleaned["systolic_bp"], df_cleaned["diastolic_bp"])[0][1], 3)))
display(df.shape)
display(df_cleaned.shape)
Original dataset: 0.016 Dataset without outliers: 0.649
(66792, 13)
(65777, 13)
As it can be seen from the scatterplot the two variables that represent the patient blood pressure are very correlated, but this correlation is hidden in the original data due to outliers presence.
Analizing relationship between BMI, weight and height
# removing height and weight outliers
display(df_cleaned.shape)
df_cleaned = df_cleaned[(np.abs(stats.zscore(df_cleaned["height"])) < 4)]
df_cleaned = df_cleaned[(np.abs(stats.zscore(df_cleaned["weight"])) < 4)]
display(df_cleaned.shape)
(65777, 13)
(65493, 13)
plt.subplot(2, 2, 1)
ax = sns.scatterplot(data=df, x="weight", y="BMI")
ax.set_xlabel("weight ($kg$)")
ax.set_ylabel("BMI ($kg/m^2$)")
ax.set_title("Weight vs BMI scatterplot")
plt.subplot(2, 2, 2)
ax = sns.scatterplot(data=df_cleaned, x="weight", y="BMI")
ax.set_xlabel("weight ($kg$)")
ax.set_ylabel("BMI ($kg/m^2$)")
ax.set_title("Weight vs BMI scatterplot - cleaned dataset")
plt.subplot(2, 2, 3)
ax = sns.scatterplot(data=df, x="height", y="BMI")
ax.set_xlabel("height ($cm$)")
ax.set_ylabel("BMI ($kg/m^2$)")
ax.set_title("Height vs BMI scatterplot")
plt.subplot(2, 2, 4)
ax = sns.scatterplot(data=df_cleaned, x="height", y="BMI")
ax.set_xlabel("height ($cm$)")
ax.set_ylabel("BMI ($kg/m^2$)")
ax.set_title("Height vs BMI scatterplot - cleaned dataset")
plt.subplots_adjust(top=1)
plt.show()
sns.pairplot(data=df_cleaned[["age", "height", "weight", "BMI", "systolic_bp", "diastolic_bp", "cardio_disease"]], hue='cardio_disease')
plt.show()
ax = sns.heatmap(df_cleaned[["age", "height", "weight", "BMI", "systolic_bp", "diastolic_bp"]].corr(), annot=True, fmt=".2g", linewidths=.1, center=0)
ax.set_title("Numeric variables correlation matrix - cleaned dataset")
plt.show()
df = map_values(df, to_numeric=True)
df_cleaned = map_values(df_cleaned, to_numeric=True)
PCA
df.head()
| age | gender | height | weight | BMI | systolic_bp | diastolic_bp | cholesterol | glucose | smoke | alcool_intake | physical_activity | cardio_disease | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| id | |||||||||||||
| 0 | 50 | 1 | 168 | 62.0 | 21.967120 | 110 | 80 | 1 | 1 | 0 | 0 | 1 | 0 |
| 1 | 55 | 0 | 156 | 85.0 | 34.927679 | 140 | 90 | 3 | 1 | 0 | 0 | 1 | 1 |
| 2 | 51 | 0 | 165 | 64.0 | 23.507805 | 130 | 70 | 3 | 1 | 0 | 0 | 0 | 1 |
| 3 | 48 | 1 | 169 | 82.0 | 28.710479 | 150 | 100 | 1 | 1 | 0 | 0 | 1 | 1 |
| 4 | 47 | 0 | 156 | 56.0 | 23.011177 | 100 | 60 | 1 | 1 | 0 | 0 | 0 | 0 |
scaler = StandardScaler()
scaler.fit(df_cleaned[["age", "height", "weight", "BMI", "systolic_bp", "diastolic_bp"]])
df_1 = scaler.transform(df_cleaned[["age", "height", "weight", "BMI", "systolic_bp", "diastolic_bp"]])
pca = PCA()
df_1_reduced = pca.fit_transform(df_1)
def biplot(score, coeff, labels=["age", "height", "weight", "BMI", "systolic_bp", "diastolic_bp"]):
xs = score[:,0]
ys = score[:,1]
n = coeff.shape[0]
scalex = 1.0/(xs.max() - xs.min())
scaley = 1.0/(ys.max() - ys.min())
plt.scatter(xs * scalex, ys * scaley) # Display data points
# Diplay arrows and labels
for i in range(n):
plt.arrow(0, 0, coeff[i,0], coeff[i,1], color = 'r', alpha = 0.5)
if labels is None:
plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, "Var" + tr(i+1), color = 'g', ha = 'center', va = 'center')
else:
plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, labels[i], color = 'g', ha = 'center', va = 'center')
# Plot settings
plt.xlim(-1,1)
plt.ylim(-1,1)
plt.xlabel("PC{}".format(1))
plt.ylabel("PC{}".format(2))
plt.title("Biplot - 2 components")
plt.grid()
PCs = np.arange(pca.n_components_) + 1
cumulative_explained_variance = []
sum = 0
for i in pca.explained_variance_ratio_:
sum += i
cumulative_explained_variance.append(sum)
plt.plot(PCs, pca.explained_variance_ratio_, 'o-', linewidth=2, color='blue')
plt.plot(PCs, cumulative_explained_variance, 'o-', linewidth=2, color='red')
plt.title('Scree Plot')
plt.xlabel('Principal Component')
plt.ylabel('Variance Explained')
plt.show()
The elbow on the scree plot occurs when choosing 4 PCs, but I will proceed with 2 and 3 PCs for data visualization's sake
biplot(df_1_reduced[:,0:2], np.transpose(pca.components_[0:2, :]))
plt.show()
def threeD_biplot(score, coeff, labels=["age", "height", "weight", "BMI", "systolic_bp", "diastolic_bp"]):
xs = score[:,0]
ys = score[:,1]
zs = score[:,2]
n = coeff.shape[0]
scalex = 1.0/(xs.max() - xs.min())
scaley = 1.0/(ys.max() - ys.min())
scalez = 1.0/(zs.max() - zs.min())
fig = plt.figure()
ax = fig.add_subplot(projection='3d')
ax.scatter(xs * scalex, ys * scaley, zs * scalez)
'''
for i in range(n):
plt.arrow(0, 0, coeff[i,0], coeff[i,1],color = 'r',alpha = 0.5)
if labels is None:
plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, coeff[i,2] * 1.15, "Var"+str(i+1), color = 'g', ha = 'center', va = 'center')
else:
plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, coeff[i,2] * 1.15, labels[i], color = 'g', ha = 'center', va = 'center')
ax.xlim(-1,1)
ax.ylim(-1,1)
ax.zlim(-1,1)
'''
ax.set_xlabel("PC{}".format(1))
ax.set_ylabel("PC{}".format(2))
ax.set_zlabel("PC{}".format(3))
ax.grid()
ax.set_title("Biplot - 3 components")
threeD_biplot(df_1_reduced[:,0:3], np.transpose(pca.components_[0:3, :]))
plt.show()
for col in df_cleaned.columns:
if col != "cardio_disease":
print("Variance of {}: {}".format(col, np.var(df_cleaned[col])))
Variance of age: 46.318408138141294 Variance of gender: 0.22901034234891376 Variance of height: 62.82067588312376 Variance of weight: 196.39850624082894 Variance of BMI: 26.280408307490895 Variance of systolic_bp: 323.72448902929665 Variance of diastolic_bp: 99.73946236693824 Variance of cholesterol: 0.4741720050100033 Variance of glucose: 0.3378730537927973 Variance of smoke: 0.0833820167178163 Variance of alcool_intake: 0.05286931497493984 Variance of physical_activity: 0.1611724652999777
One hot encoding cholesterol and glucose features
encoder = OneHotEncoder()
onehotarray = encoder.fit_transform(df_cleaned[["cholesterol"]]).toarray()
items = [f'{"cholesterol"}_{item}' for item in encoder.categories_[0]]
df_cleaned[items] = onehotarray
onehotarray = encoder.fit_transform(df_cleaned[["glucose"]]).toarray()
items = [f'{"glucose"}_{item}' for item in encoder.categories_[0]]
df_cleaned[items] = onehotarray
df_cleaned = df_cleaned.drop(columns=["cholesterol", "glucose"])
onehotarray = encoder.fit_transform(df[["cholesterol"]]).toarray()
items = [f'{"cholesterol"}_{item}' for item in encoder.categories_[0]]
df[items] = onehotarray
onehotarray = encoder.fit_transform(df[["glucose"]]).toarray()
items = [f'{"glucose"}_{item}' for item in encoder.categories_[0]]
df[items] = onehotarray
df = df.drop(columns=["cholesterol", "glucose"])
Scaling datasets
scaler = StandardScaler()
scaler.fit(df[["age", "height", "weight", "BMI", "systolic_bp", "diastolic_bp"]])
df[["age", "height", "weight", "BMI", "systolic_bp", "diastolic_bp"]] = scaler.transform(df[["age", "height", "weight", "BMI", "systolic_bp", "diastolic_bp"]])
scaler.fit(df_cleaned[["age", "height", "weight", "BMI", "systolic_bp", "diastolic_bp"]])
df_cleaned[["age", "height", "weight", "BMI", "systolic_bp", "diastolic_bp"]] = scaler.transform(df_cleaned[["age", "height", "weight", "BMI", "systolic_bp", "diastolic_bp"]])
df_cleaned.head()
| age | gender | height | weight | BMI | systolic_bp | diastolic_bp | smoke | alcool_intake | physical_activity | cardio_disease | cholesterol_1 | cholesterol_2 | cholesterol_3 | glucose_1 | glucose_2 | glucose_3 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| id | |||||||||||||||||
| 0 | -0.418592 | 1 | 0.452729 | -0.873667 | -1.079791 | -0.922284 | -0.141927 | 0 | 0 | 1 | 0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
| 1 | 0.316079 | 0 | -1.061285 | 0.767523 | 1.448387 | 0.745092 | 0.859378 | 0 | 0 | 1 | 1 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 |
| 2 | -0.271658 | 0 | 0.074225 | -0.730955 | -0.779254 | 0.189300 | -1.143232 | 0 | 0 | 0 | 1 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 |
| 3 | -0.712461 | 1 | 0.578897 | 0.553454 | 0.235616 | 1.300884 | 1.860684 | 0 | 0 | 1 | 1 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
| 4 | -0.859395 | 0 | -1.061285 | -1.301803 | -0.876130 | -1.478076 | -2.144537 | 0 | 0 | 0 | 0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
#dividing the dataset into two sets: train set and test set
def tt_split(dataframe):
x = dataframe.loc[:, dataframe.columns!='cardio_disease']
y = dataframe['cardio_disease']
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=1)
return X_train, X_test, y_train, y_test
classifiers = {
"Decision Tree": (DecisionTreeClassifier(), {"predict_proba": True}, {'criterion': ("gini", "entropy"),
'splitter': ("best", "random"),
'class_weight': ["balanced"],
'random_state': [1] }),
"Random Forest": (RandomForestClassifier(), {"predict_proba": True}, {'n_estimators': [100],
'criterion': ("gini", "entropy"),
'class_weight': ["balanced"],
'max_features': ("sqrt", "log2"),
'random_state': [1] }),
"XGBClassifier": (XGBClassifier(), {"predict_proba": True}, {'n_estimators': [100],
'learning_rate': (0.01, 0.05, 0.10, 0.20, 0.30),
'tree_method': ("exact", "approx", "hist"),
'random_state': [1] }),
"Nearest Neighbors": (KNeighborsClassifier(), {"predict_proba": True}, {'n_neighbors': (5, 7, 9),
'weights': ("uniform", "distance"),
'algorithm': ("ball_tree", "kd_tree"),
'p': (1, 2, 3),
'n_jobs': [-1] }),
"Logistic Regression": (LogisticRegression(), {"predict_proba": True}, {'C': (0.0001, 0.001, 0.01, 0.1, 1, 2, 5, 10),
'solver': ('lbfgs', 'sag', 'saga'),
'max_iter': [400],
'n_jobs': [-1],
'random_state': [1] }),
#"SVC": (SVC(), {"predict_proba": True}, {'kernel': ("linear", "poly", "rbf"),
# 'solver': ('lbfgs', 'sag', 'saga'),
# 'C': (1, 5, 10),
# 'random_state': [1] }),
"LinearSVC": (LinearSVC(), {"predict_proba": False}, {'C': (0.0001, 0.001, 0.01, 0.1, 1, 2, 5, 10),
'max_iter': [2000],
'random_state': [1] }),
"Kmeans": (KMeans(), {"predict_proba": False}, {'n_clusters': [2],
'init': ("k-means++", "random"),
'algorithm': ("lloyd", "elkan"),
'random_state': [1] }),
"MLPClassifier": (MLPClassifier(), {"predict_proba": True}, {'hidden_layer_sizes': ((8, 4), (8, 4, 4)),
'activation': ["relu"],
'learning_rate': ("constant", "adaptive"),
'learning_rate_init': (0.001, 0.005, 0.01, 0.1, 0.15, 0.2),
'max_iter': [400],
'random_state': [1] })
}
result_matrixes = dict()
def classification(classifiers, X_train, y_train, X_test, y_test, n_iter=5):
result_matrix = pd.DataFrame(columns=["Classifier", "Accuracy", "Accuracy (train)", "Precision", "Precision (train)",
"Recall", "Recall (train)", "F1-Score", "F1-Score (train)", "ROC AUC", "ROC AUC (train)"])
for name, clf in classifiers.items():
print("Classifier: ", name)
# Hyperparameters optimization
if clf[2] != None:
if len(list(ParameterGrid(clf[2]))) < n_iter: classifier = GridSearchCV(clf[0], clf[2], n_jobs=-1)
else: classifier = RandomizedSearchCV(clf[0], clf[2], n_jobs=-1, n_iter=n_iter, random_state=1)
classifier.fit(X_train, y_train)
print("Best hyperparameters : {}".format(classifier.best_params_))
else:
classifier = clf[0]
classifier.fit(X_train, y_train)
# Predicion task handled by the best estimator found by the hyperparameter cross validator
y_pred = classifier.predict(X_test)
y_pred_train = classifier.predict(X_train)
# Getting predicted probabilities
if clf[1]["predict_proba"] == True:
y_score = classifier.predict_proba(X_test)[:,1]
#display("Predicted probability: ", y_score)
y_score_train = classifier.predict_proba(X_train)[:,1]
#display("Predicted probability: ", y_score_train)
# Test set metrics
pr, rc, fs, sup = metrics.precision_recall_fscore_support(y_test, y_pred, average='macro')
pr_train, rc_train, fs_train, sup_train = metrics.precision_recall_fscore_support(y_test, y_pred, average='macro')
result_matrix = pd.concat([result_matrix, pd.DataFrame({"Classifier": name,
"Accuracy": round(metrics.accuracy_score(y_test, y_pred), 4),
"Accuracy (train)": round(metrics.accuracy_score(y_train, y_pred_train), 4),
"Precision": round(pr, 4),
"Precision (train)": round(pr_train, 4),
"Recall": round(rc, 4),
"Recall (train)": round(rc_train, 4),
"F1-Score": round(fs, 4),
"F1-Score (train)": round(fs_train, 4),
"ROC AUC": roc_auc_score(y_test, y_score) if clf[1]["predict_proba"] else None,
"ROC AUC (train)": roc_auc_score(y_train, y_score_train) if clf[1]["predict_proba"] else None }, index=[0])])
# Confusion matrix for test set
cf_matrix = confusion_matrix(y_test, y_pred)
plt.subplot(2, 1, 1)
cf_plot = sns.heatmap(cf_matrix, annot=True, fmt="d", cmap='Blues')
cf_plot.set_title("Confusion matrix - test set")
# Confusion matrix for train set
plt.subplot(2, 1, 2)
cf_matrix = confusion_matrix(y_train, y_pred_train)
cf_plot_train = sns.heatmap(cf_matrix, annot=True, fmt="d", cmap='Blues')
cf_plot_train.set_title("Confusion matrix - training set")
plt.show()
# ROC AUC curve
if name != "Kmeans":
RocCurveDisplay.from_estimator(classifier, X_test, y_test)
print("ROC curve - test set")
plt.show()
RocCurveDisplay.from_estimator(classifier, X_train, y_train)
print("ROC curve - training set")
plt.show()
result_matrix.set_index("F1-Score", inplace=True)
result_matrix.sort_values(by="F1-Score", ascending=False, inplace=True)
return result_matrix
X_train, X_test, y_train, y_test = tt_split(df)
result_matrixes["Plain dataset"] = classification(classifiers, X_train, y_train, X_test, y_test, n_iter=10)
display(result_matrixes["Plain dataset"])
Classifier: Decision Tree
Best hyperparameters : {'class_weight': 'balanced', 'criterion': 'entropy', 'random_state': 1, 'splitter': 'best'}
ROC curve - test set
ROC curve - training set
Classifier: Random Forest
Best hyperparameters : {'class_weight': 'balanced', 'criterion': 'gini', 'max_features': 'sqrt', 'n_estimators': 100, 'random_state': 1}
ROC curve - test set
ROC curve - training set
Classifier: XGBClassifier
Best hyperparameters : {'tree_method': 'approx', 'random_state': 1, 'n_estimators': 100, 'learning_rate': 0.1}
ROC curve - test set
ROC curve - training set
Classifier: Nearest Neighbors
Best hyperparameters : {'weights': 'uniform', 'p': 1, 'n_neighbors': 9, 'n_jobs': -1, 'algorithm': 'kd_tree'}
ROC curve - test set
ROC curve - training set
Classifier: Logistic Regression
Best hyperparameters : {'solver': 'lbfgs', 'random_state': 1, 'n_jobs': -1, 'max_iter': 400, 'C': 5}
ROC curve - test set
ROC curve - training set
Classifier: LinearSVC
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn(
Best hyperparameters : {'C': 10, 'max_iter': 2000, 'random_state': 1}
ROC curve - test set
ROC curve - training set
Classifier: Kmeans
Best hyperparameters : {'algorithm': 'lloyd', 'init': 'random', 'n_clusters': 2, 'random_state': 1}
Classifier: MLPClassifier
Best hyperparameters : {'random_state': 1, 'max_iter': 400, 'learning_rate_init': 0.01, 'learning_rate': 'constant', 'hidden_layer_sizes': (8, 4), 'activation': 'relu'}
ROC curve - test set
ROC curve - training set
| Classifier | Accuracy | Accuracy (train) | Precision | Precision (train) | Recall | Recall (train) | F1-Score (train) | ROC AUC | ROC AUC (train) | |
|---|---|---|---|---|---|---|---|---|---|---|
| F1-Score | ||||||||||
| 0.7287 | XGBClassifier | 0.7287 | 0.7483 | 0.7297 | 0.7297 | 0.7296 | 0.7296 | 0.7287 | 0.796294 | 0.824923 |
| 0.7257 | MLPClassifier | 0.7257 | 0.7334 | 0.7269 | 0.7269 | 0.7267 | 0.7267 | 0.7257 | 0.793346 | 0.800074 |
| 0.7193 | Logistic Regression | 0.7193 | 0.7229 | 0.7202 | 0.7202 | 0.7201 | 0.7201 | 0.7193 | 0.782172 | 0.786405 |
| 0.7042 | LinearSVC | 0.7042 | 0.7093 | 0.7047 | 0.7047 | 0.7048 | 0.7048 | 0.7042 | None | None |
| 0.6953 | Random Forest | 0.6955 | 0.9834 | 0.6953 | 0.6953 | 0.6954 | 0.6954 | 0.6953 | 0.747861 | 0.99942 |
| 0.6571 | Nearest Neighbors | 0.6571 | 0.7368 | 0.6574 | 0.6574 | 0.6575 | 0.6575 | 0.6571 | 0.710222 | 0.811528 |
| 0.6257 | Decision Tree | 0.6259 | 0.9834 | 0.6257 | 0.6257 | 0.6258 | 0.6258 | 0.6257 | 0.624746 | 0.999449 |
| 0.5669 | Kmeans | 0.5776 | 0.5796 | 0.5943 | 0.5943 | 0.5832 | 0.5832 | 0.5669 | None | None |
X_train, X_test, y_train, y_test = tt_split(df_cleaned)
result_matrixes["Cleaned dataset"] = classification(classifiers, X_train, y_train, X_test, y_test, n_iter=10)
display(result_matrixes["Cleaned dataset"])
Classifier: Decision Tree
Best hyperparameters : {'class_weight': 'balanced', 'criterion': 'gini', 'random_state': 1, 'splitter': 'best'}
ROC curve - test set
ROC curve - training set
Classifier: Random Forest
Best hyperparameters : {'class_weight': 'balanced', 'criterion': 'entropy', 'max_features': 'sqrt', 'n_estimators': 100, 'random_state': 1}
ROC curve - test set
ROC curve - training set
Classifier: XGBClassifier
Best hyperparameters : {'tree_method': 'exact', 'random_state': 1, 'n_estimators': 100, 'learning_rate': 0.05}
ROC curve - test set
ROC curve - training set
Classifier: Nearest Neighbors
Best hyperparameters : {'weights': 'uniform', 'p': 1, 'n_neighbors': 9, 'n_jobs': -1, 'algorithm': 'kd_tree'}
ROC curve - test set
ROC curve - training set
Classifier: Logistic Regression
Best hyperparameters : {'solver': 'sag', 'random_state': 1, 'n_jobs': -1, 'max_iter': 400, 'C': 0.1}
ROC curve - test set
ROC curve - training set
Classifier: LinearSVC
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn(
Best hyperparameters : {'C': 10, 'max_iter': 2000, 'random_state': 1}
ROC curve - test set
ROC curve - training set
Classifier: Kmeans
Best hyperparameters : {'algorithm': 'lloyd', 'init': 'random', 'n_clusters': 2, 'random_state': 1}
Classifier: MLPClassifier
Best hyperparameters : {'random_state': 1, 'max_iter': 400, 'learning_rate_init': 0.001, 'learning_rate': 'adaptive', 'hidden_layer_sizes': (8, 4, 4), 'activation': 'relu'}
ROC curve - test set
ROC curve - training set
| Classifier | Accuracy | Accuracy (train) | Precision | Precision (train) | Recall | Recall (train) | F1-Score (train) | ROC AUC | ROC AUC (train) | |
|---|---|---|---|---|---|---|---|---|---|---|
| F1-Score | ||||||||||
| 0.7272 | MLPClassifier | 0.7276 | 0.7332 | 0.7307 | 0.7307 | 0.7287 | 0.7287 | 0.7272 | 0.79295 | 0.800408 |
| 0.7269 | XGBClassifier | 0.7272 | 0.7424 | 0.7297 | 0.7297 | 0.7281 | 0.7281 | 0.7269 | 0.793989 | 0.817191 |
| 0.7194 | Logistic Regression | 0.7196 | 0.7242 | 0.7216 | 0.7216 | 0.7205 | 0.7205 | 0.7194 | 0.781189 | 0.788357 |
| 0.7184 | LinearSVC | 0.7186 | 0.7229 | 0.7208 | 0.7208 | 0.7195 | 0.7195 | 0.7184 | None | None |
| 0.7051 | Nearest Neighbors | 0.7051 | 0.7592 | 0.7055 | 0.7055 | 0.7055 | 0.7055 | 0.7051 | 0.757223 | 0.840482 |
| 0.6894 | Random Forest | 0.6894 | 0.9838 | 0.6895 | 0.6895 | 0.6895 | 0.6895 | 0.6894 | 0.743143 | 0.999439 |
| 0.6422 | Kmeans | 0.6465 | 0.6571 | 0.6578 | 0.6578 | 0.6490 | 0.6490 | 0.6422 | None | None |
| 0.6131 | Decision Tree | 0.6132 | 0.9838 | 0.6131 | 0.6131 | 0.6131 | 0.6131 | 0.6131 | 0.612691 | 0.999474 |
X_train, X_test, y_train, y_test = tt_split(df_cleaned[["age", "gender", "height", "weight", "BMI", "systolic_bp", "diastolic_bp", "cholesterol_1", "cholesterol_2", "cholesterol_3", "glucose_1", "glucose_2", "glucose_3", "physical_activity", "cardio_disease"]])
result_matrixes["Cleaned dataset without smoke and alcool features"] = classification(classifiers, X_train, y_train, X_test, y_test, n_iter=10)
display(result_matrixes["Cleaned dataset without smoke and alcool features"])
Classifier: Decision Tree
Best hyperparameters : {'class_weight': 'balanced', 'criterion': 'entropy', 'random_state': 1, 'splitter': 'random'}
ROC curve - test set
ROC curve - training set
Classifier: Random Forest
Best hyperparameters : {'class_weight': 'balanced', 'criterion': 'entropy', 'max_features': 'sqrt', 'n_estimators': 100, 'random_state': 1}
ROC curve - test set
ROC curve - training set
Classifier: XGBClassifier
Best hyperparameters : {'tree_method': 'exact', 'random_state': 1, 'n_estimators': 100, 'learning_rate': 0.05}
ROC curve - test set
ROC curve - training set
Classifier: Nearest Neighbors
Best hyperparameters : {'weights': 'uniform', 'p': 1, 'n_neighbors': 9, 'n_jobs': -1, 'algorithm': 'kd_tree'}
ROC curve - test set
ROC curve - training set
Classifier: Logistic Regression
Best hyperparameters : {'solver': 'sag', 'random_state': 1, 'n_jobs': -1, 'max_iter': 400, 'C': 1}
ROC curve - test set
ROC curve - training set
Classifier: LinearSVC
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn(
Best hyperparameters : {'C': 5, 'max_iter': 2000, 'random_state': 1}
ROC curve - test set
ROC curve - training set
Classifier: Kmeans
Best hyperparameters : {'algorithm': 'lloyd', 'init': 'random', 'n_clusters': 2, 'random_state': 1}
Classifier: MLPClassifier
Best hyperparameters : {'random_state': 1, 'max_iter': 400, 'learning_rate_init': 0.001, 'learning_rate': 'adaptive', 'hidden_layer_sizes': (8, 4, 4), 'activation': 'relu'}
ROC curve - test set
ROC curve - training set
| Classifier | Accuracy | Accuracy (train) | Precision | Precision (train) | Recall | Recall (train) | F1-Score (train) | ROC AUC | ROC AUC (train) | |
|---|---|---|---|---|---|---|---|---|---|---|
| F1-Score | ||||||||||
| 0.7281 | MLPClassifier | 0.7281 | 0.7325 | 0.7295 | 0.7295 | 0.7288 | 0.7288 | 0.7281 | 0.791201 | 0.798018 |
| 0.7254 | XGBClassifier | 0.7256 | 0.7423 | 0.7280 | 0.7280 | 0.7266 | 0.7266 | 0.7254 | 0.793193 | 0.816049 |
| 0.7191 | Logistic Regression | 0.7193 | 0.7243 | 0.7213 | 0.7213 | 0.7202 | 0.7202 | 0.7191 | 0.780494 | 0.787784 |
| 0.7178 | LinearSVC | 0.7181 | 0.7230 | 0.7203 | 0.7203 | 0.7190 | 0.7190 | 0.7178 | None | None |
| 0.7041 | Nearest Neighbors | 0.7041 | 0.7600 | 0.7046 | 0.7046 | 0.7045 | 0.7045 | 0.7041 | 0.755588 | 0.840698 |
| 0.6860 | Random Forest | 0.6861 | 0.9817 | 0.6861 | 0.6861 | 0.6861 | 0.6861 | 0.6860 | 0.742809 | 0.999219 |
| 0.6422 | Kmeans | 0.6465 | 0.6572 | 0.6579 | 0.6579 | 0.6490 | 0.6490 | 0.6422 | None | None |
| 0.6222 | Decision Tree | 0.6223 | 0.9817 | 0.6223 | 0.6223 | 0.6224 | 0.6224 | 0.6222 | 0.621779 | 0.999332 |
df_obese = df_cleaned[["age", "gender", "height", "weight", "BMI", "systolic_bp", "diastolic_bp", "cholesterol_1", "cholesterol_2", "cholesterol_3", "glucose_1", "glucose_2", "glucose_3", "physical_activity", "cardio_disease"]]
df_obese = df_obese.loc[BMI >= 25]
X_train, X_test, y_train, y_test = tt_split(df_obese)
result_matrixes["Obese/overweight dataset"] = classification(classifiers, X_train, y_train, X_test, y_test, n_iter=10)
display(result_matrixes["Obese/overweight dataset"])
Classifier: Decision Tree
Best hyperparameters : {'class_weight': 'balanced', 'criterion': 'entropy', 'random_state': 1, 'splitter': 'random'}
ROC curve - test set
ROC curve - training set
Classifier: Random Forest
Best hyperparameters : {'class_weight': 'balanced', 'criterion': 'entropy', 'max_features': 'sqrt', 'n_estimators': 100, 'random_state': 1}
ROC curve - test set
ROC curve - training set
Classifier: XGBClassifier
Best hyperparameters : {'tree_method': 'exact', 'random_state': 1, 'n_estimators': 100, 'learning_rate': 0.05}
ROC curve - test set
ROC curve - training set
Classifier: Nearest Neighbors
Best hyperparameters : {'weights': 'uniform', 'p': 1, 'n_neighbors': 9, 'n_jobs': -1, 'algorithm': 'kd_tree'}
ROC curve - test set
ROC curve - training set
Classifier: Logistic Regression
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/linear_model/_sag.py:350: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/linear_model/_sag.py:350: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/linear_model/_sag.py:350: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/linear_model/_sag.py:350: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/linear_model/_sag.py:350: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/linear_model/_sag.py:350: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/linear_model/_sag.py:350: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/linear_model/_sag.py:350: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/linear_model/_sag.py:350: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/linear_model/_sag.py:350: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/linear_model/_sag.py:350: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge warnings.warn(
Best hyperparameters : {'solver': 'sag', 'random_state': 1, 'n_jobs': -1, 'max_iter': 400, 'C': 1}
ROC curve - test set
ROC curve - training set
Classifier: LinearSVC
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn(
Best hyperparameters : {'C': 0.001, 'max_iter': 2000, 'random_state': 1}
ROC curve - test set
ROC curve - training set
Classifier: Kmeans
Best hyperparameters : {'algorithm': 'lloyd', 'init': 'random', 'n_clusters': 2, 'random_state': 1}
Classifier: MLPClassifier
Best hyperparameters : {'random_state': 1, 'max_iter': 400, 'learning_rate_init': 0.001, 'learning_rate': 'adaptive', 'hidden_layer_sizes': (8, 4, 4), 'activation': 'relu'}
ROC curve - test set
ROC curve - training set
| Classifier | Accuracy | Accuracy (train) | Precision | Precision (train) | Recall | Recall (train) | F1-Score (train) | ROC AUC | ROC AUC (train) | |
|---|---|---|---|---|---|---|---|---|---|---|
| F1-Score | ||||||||||
| 0.7192 | XGBClassifier | 0.7233 | 0.7422 | 0.7192 | 0.7192 | 0.7191 | 0.7191 | 0.7192 | 0.788441 | 0.813985 |
| 0.7185 | MLPClassifier | 0.7218 | 0.7292 | 0.7181 | 0.7181 | 0.7191 | 0.7191 | 0.7185 | 0.785613 | 0.786888 |
| 0.7174 | Logistic Regression | 0.7222 | 0.7202 | 0.7181 | 0.7181 | 0.7168 | 0.7168 | 0.7174 | 0.776475 | 0.775423 |
| 0.7151 | LinearSVC | 0.7202 | 0.7191 | 0.7162 | 0.7162 | 0.7143 | 0.7143 | 0.7151 | None | None |
| 0.6942 | Nearest Neighbors | 0.6996 | 0.7555 | 0.6950 | 0.6950 | 0.6937 | 0.6937 | 0.6942 | 0.748889 | 0.83151 |
| 0.6819 | Random Forest | 0.6890 | 0.9873 | 0.6842 | 0.6842 | 0.6808 | 0.6808 | 0.6819 | 0.742657 | 0.999535 |
| 0.6155 | Decision Tree | 0.6217 | 0.9873 | 0.6158 | 0.6158 | 0.6154 | 0.6154 | 0.6155 | 0.614553 | 0.999673 |
| 0.3279 | Kmeans | 0.3436 | 0.3447 | 0.3265 | 0.3265 | 0.3301 | 0.3301 | 0.3279 | None | None |
from sklearn.pipeline import Pipeline
def classification_clustering_preprocessing(classifiers, X_train, y_train, X_test, y_test, n_iter=5, n_clusters=10):
result_matrix = pd.DataFrame(columns=["Classifier", "Accuracy", "Accuracy (train)", "Precision", "Precision (train)",
"Recall", "Recall (train)", "F1-Score", "F1-Score (train)", "ROC AUC", "ROC AUC (train)"])
for name, clf in classifiers.items():
print("Classifier: ", name)
# Hyperparameters optimization
if clf[2] != None:
if len(list(ParameterGrid(clf[2]))) < n_iter: classifier = GridSearchCV(clf[0], clf[2], n_jobs=-1)
else: classifier = RandomizedSearchCV(clf[0], clf[2], n_jobs=-1, n_iter=n_iter, random_state=1)
pipe = Pipeline([("kmeans", KMeans(n_clusters=n_clusters)),
("classifier", classifier) ])
pipe.fit(X_train, y_train)
print("Best hyperparameters : {}".format(classifier.best_params_))
else:
classifier = clf[0]
pipe = Pipeline([("kmeans", KMeans(n_clusters = 10)),
("classifier", classifier) ])
pipe.fit(X_train, y_train)
pipe.fit(X_train, y_train)
# Predicion task handled by the best estimator found by the hyperparameter cross validator
y_pred = pipe.predict(X_test)
y_pred_train = pipe.predict(X_train)
# Getting predicted probabilities
if clf[1]["predict_proba"] == True:
y_score = pipe.predict_proba(X_test)[:,1]
#display("Predicted probability: ", y_score)
y_score_train = pipe.predict_proba(X_train)[:,1]
#display("Predicted probability: ", y_score_train)
# Test set metrics
pr, rc, fs, sup = metrics.precision_recall_fscore_support(y_test, y_pred, average='macro')
pr_train, rc_train, fs_train, sup_train = metrics.precision_recall_fscore_support(y_test, y_pred, average='macro')
result_matrix = pd.concat([result_matrix, pd.DataFrame({"Classifier": name,
"Accuracy": round(metrics.accuracy_score(y_test, y_pred), 4),
"Accuracy (train)": round(metrics.accuracy_score(y_train, y_pred_train), 4),
"Precision": round(pr, 4),
"Precision (train)": round(pr_train, 4),
"Recall": round(rc, 4),
"Recall (train)": round(rc_train, 4),
"F1-Score": round(fs, 4),
"F1-Score (train)": round(fs_train, 4),
"ROC AUC": roc_auc_score(y_test, y_score) if clf[1]["predict_proba"] else None,
"ROC AUC (train)": roc_auc_score(y_train, y_score_train) if clf[1]["predict_proba"] else None }, index=[0])])
# Confusion matrix for test set
cf_matrix = confusion_matrix(y_test, y_pred)
plt.subplot(2, 1, 1)
cf_plot = sns.heatmap(cf_matrix, annot=True, fmt="d", cmap='Blues')
cf_plot.set_title("Confusion matrix - test set")
# Confusion matrix for train set
plt.subplot(2, 1, 2)
cf_matrix = confusion_matrix(y_train, y_pred_train)
cf_plot_train = sns.heatmap(cf_matrix, annot=True, fmt="d", cmap='Blues')
cf_plot_train.set_title("Confusion matrix - training set")
plt.show()
# ROC AUC curve
if name != "Kmeans":
RocCurveDisplay.from_estimator(pipe, X_test, y_test)
print("ROC curve - test set")
plt.show()
RocCurveDisplay.from_estimator(pipe, X_train, y_train)
print("ROC curve - training set")
plt.show()
result_matrix.set_index("F1-Score", inplace=True)
result_matrix.sort_values(by="F1-Score", ascending=False, inplace=True)
return result_matrix
X_train, X_test, y_train, y_test = tt_split(df_cleaned)
result_matrixes["Cleaned dataset - clustering preprocessing"] = classification_clustering_preprocessing(classifiers, X_train, y_train, X_test, y_test, n_iter=10, n_clusters=10)
display(result_matrixes["Cleaned dataset - clustering preprocessing"])
Classifier: Decision Tree
Best hyperparameters : {'class_weight': 'balanced', 'criterion': 'gini', 'random_state': 1, 'splitter': 'best'}
ROC curve - test set
ROC curve - training set
Classifier: Random Forest
Best hyperparameters : {'class_weight': 'balanced', 'criterion': 'entropy', 'max_features': 'sqrt', 'n_estimators': 100, 'random_state': 1}
ROC curve - test set
ROC curve - training set
Classifier: XGBClassifier
Best hyperparameters : {'tree_method': 'exact', 'random_state': 1, 'n_estimators': 100, 'learning_rate': 0.05}
ROC curve - test set
ROC curve - training set
Classifier: Nearest Neighbors
Best hyperparameters : {'weights': 'uniform', 'p': 1, 'n_neighbors': 9, 'n_jobs': -1, 'algorithm': 'kd_tree'}
ROC curve - test set
ROC curve - training set
Classifier: Logistic Regression
Best hyperparameters : {'solver': 'sag', 'random_state': 1, 'n_jobs': -1, 'max_iter': 400, 'C': 0.1}
ROC curve - test set
ROC curve - training set
Classifier: LinearSVC
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( /home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn(
Best hyperparameters : {'C': 0.01, 'max_iter': 2000, 'random_state': 1}
ROC curve - test set
ROC curve - training set
Classifier: Kmeans
Best hyperparameters : {'algorithm': 'lloyd', 'init': 'random', 'n_clusters': 2, 'random_state': 1}
Classifier: MLPClassifier
Best hyperparameters : {'random_state': 1, 'max_iter': 400, 'learning_rate_init': 0.005, 'learning_rate': 'constant', 'hidden_layer_sizes': (8, 4, 4), 'activation': 'relu'}
ROC curve - test set
ROC curve - training set
| Classifier | Accuracy | Accuracy (train) | Precision | Precision (train) | Recall | Recall (train) | F1-Score (train) | ROC AUC | ROC AUC (train) | |
|---|---|---|---|---|---|---|---|---|---|---|
| F1-Score | ||||||||||
| 0.7120 | MLPClassifier | 0.7120 | 0.7178 | 0.7130 | 0.7130 | 0.7126 | 0.7126 | 0.7120 | 0.772879 | 0.782469 |
| 0.7091 | XGBClassifier | 0.7092 | 0.7329 | 0.7104 | 0.7104 | 0.7099 | 0.7099 | 0.7091 | 0.773 | 0.808088 |
| 0.7067 | Logistic Regression | 0.7068 | 0.7149 | 0.7083 | 0.7083 | 0.7076 | 0.7076 | 0.7067 | 0.768493 | 0.777507 |
| 0.7058 | LinearSVC | 0.7060 | 0.7141 | 0.7077 | 0.7077 | 0.7068 | 0.7068 | 0.7058 | None | None |
| 0.6887 | Nearest Neighbors | 0.6887 | 0.7501 | 0.6889 | 0.6889 | 0.6889 | 0.6889 | 0.6887 | 0.740162 | 0.829688 |
| 0.6817 | Random Forest | 0.6817 | 0.9837 | 0.6818 | 0.6818 | 0.6819 | 0.6819 | 0.6817 | 0.731347 | 0.999447 |
| 0.5981 | Decision Tree | 0.5981 | 0.9838 | 0.5981 | 0.5981 | 0.5981 | 0.5981 | 0.5981 | 0.597407 | 0.999474 |
| 0.3614 | Kmeans | 0.4080 | 0.3944 | 0.3667 | 0.3667 | 0.4026 | 0.4026 | 0.3614 | None | None |
for data,result in result_matrixes.items():
display(data, result)
'Plain dataset'
| Classifier | Accuracy | Accuracy (train) | Precision | Precision (train) | Recall | Recall (train) | F1-Score (train) | ROC AUC | ROC AUC (train) | |
|---|---|---|---|---|---|---|---|---|---|---|
| F1-Score | ||||||||||
| 0.7287 | XGBClassifier | 0.7287 | 0.7483 | 0.7297 | 0.7297 | 0.7296 | 0.7296 | 0.7287 | 0.796294 | 0.824923 |
| 0.7257 | MLPClassifier | 0.7257 | 0.7334 | 0.7269 | 0.7269 | 0.7267 | 0.7267 | 0.7257 | 0.793346 | 0.800074 |
| 0.7193 | Logistic Regression | 0.7193 | 0.7229 | 0.7202 | 0.7202 | 0.7201 | 0.7201 | 0.7193 | 0.782172 | 0.786405 |
| 0.7042 | LinearSVC | 0.7042 | 0.7093 | 0.7047 | 0.7047 | 0.7048 | 0.7048 | 0.7042 | None | None |
| 0.6953 | Random Forest | 0.6955 | 0.9834 | 0.6953 | 0.6953 | 0.6954 | 0.6954 | 0.6953 | 0.747861 | 0.99942 |
| 0.6571 | Nearest Neighbors | 0.6571 | 0.7368 | 0.6574 | 0.6574 | 0.6575 | 0.6575 | 0.6571 | 0.710222 | 0.811528 |
| 0.6257 | Decision Tree | 0.6259 | 0.9834 | 0.6257 | 0.6257 | 0.6258 | 0.6258 | 0.6257 | 0.624746 | 0.999449 |
| 0.5669 | Kmeans | 0.5776 | 0.5796 | 0.5943 | 0.5943 | 0.5832 | 0.5832 | 0.5669 | None | None |
'Cleaned dataset'
| Classifier | Accuracy | Accuracy (train) | Precision | Precision (train) | Recall | Recall (train) | F1-Score (train) | ROC AUC | ROC AUC (train) | |
|---|---|---|---|---|---|---|---|---|---|---|
| F1-Score | ||||||||||
| 0.7272 | MLPClassifier | 0.7276 | 0.7332 | 0.7307 | 0.7307 | 0.7287 | 0.7287 | 0.7272 | 0.79295 | 0.800408 |
| 0.7269 | XGBClassifier | 0.7272 | 0.7424 | 0.7297 | 0.7297 | 0.7281 | 0.7281 | 0.7269 | 0.793989 | 0.817191 |
| 0.7194 | Logistic Regression | 0.7196 | 0.7242 | 0.7216 | 0.7216 | 0.7205 | 0.7205 | 0.7194 | 0.781189 | 0.788357 |
| 0.7184 | LinearSVC | 0.7186 | 0.7229 | 0.7208 | 0.7208 | 0.7195 | 0.7195 | 0.7184 | None | None |
| 0.7051 | Nearest Neighbors | 0.7051 | 0.7592 | 0.7055 | 0.7055 | 0.7055 | 0.7055 | 0.7051 | 0.757223 | 0.840482 |
| 0.6894 | Random Forest | 0.6894 | 0.9838 | 0.6895 | 0.6895 | 0.6895 | 0.6895 | 0.6894 | 0.743143 | 0.999439 |
| 0.6422 | Kmeans | 0.6465 | 0.6571 | 0.6578 | 0.6578 | 0.6490 | 0.6490 | 0.6422 | None | None |
| 0.6131 | Decision Tree | 0.6132 | 0.9838 | 0.6131 | 0.6131 | 0.6131 | 0.6131 | 0.6131 | 0.612691 | 0.999474 |
'Cleaned dataset without smoke and alcool features'
| Classifier | Accuracy | Accuracy (train) | Precision | Precision (train) | Recall | Recall (train) | F1-Score (train) | ROC AUC | ROC AUC (train) | |
|---|---|---|---|---|---|---|---|---|---|---|
| F1-Score | ||||||||||
| 0.7281 | MLPClassifier | 0.7281 | 0.7325 | 0.7295 | 0.7295 | 0.7288 | 0.7288 | 0.7281 | 0.791201 | 0.798018 |
| 0.7254 | XGBClassifier | 0.7256 | 0.7423 | 0.7280 | 0.7280 | 0.7266 | 0.7266 | 0.7254 | 0.793193 | 0.816049 |
| 0.7191 | Logistic Regression | 0.7193 | 0.7243 | 0.7213 | 0.7213 | 0.7202 | 0.7202 | 0.7191 | 0.780494 | 0.787784 |
| 0.7178 | LinearSVC | 0.7181 | 0.7230 | 0.7203 | 0.7203 | 0.7190 | 0.7190 | 0.7178 | None | None |
| 0.7041 | Nearest Neighbors | 0.7041 | 0.7600 | 0.7046 | 0.7046 | 0.7045 | 0.7045 | 0.7041 | 0.755588 | 0.840698 |
| 0.6860 | Random Forest | 0.6861 | 0.9817 | 0.6861 | 0.6861 | 0.6861 | 0.6861 | 0.6860 | 0.742809 | 0.999219 |
| 0.6422 | Kmeans | 0.6465 | 0.6572 | 0.6579 | 0.6579 | 0.6490 | 0.6490 | 0.6422 | None | None |
| 0.6222 | Decision Tree | 0.6223 | 0.9817 | 0.6223 | 0.6223 | 0.6224 | 0.6224 | 0.6222 | 0.621779 | 0.999332 |
'Obese/overweight dataset'
| Classifier | Accuracy | Accuracy (train) | Precision | Precision (train) | Recall | Recall (train) | F1-Score (train) | ROC AUC | ROC AUC (train) | |
|---|---|---|---|---|---|---|---|---|---|---|
| F1-Score | ||||||||||
| 0.7192 | XGBClassifier | 0.7233 | 0.7422 | 0.7192 | 0.7192 | 0.7191 | 0.7191 | 0.7192 | 0.788441 | 0.813985 |
| 0.7185 | MLPClassifier | 0.7218 | 0.7292 | 0.7181 | 0.7181 | 0.7191 | 0.7191 | 0.7185 | 0.785613 | 0.786888 |
| 0.7174 | Logistic Regression | 0.7222 | 0.7202 | 0.7181 | 0.7181 | 0.7168 | 0.7168 | 0.7174 | 0.776475 | 0.775423 |
| 0.7151 | LinearSVC | 0.7202 | 0.7191 | 0.7162 | 0.7162 | 0.7143 | 0.7143 | 0.7151 | None | None |
| 0.6942 | Nearest Neighbors | 0.6996 | 0.7555 | 0.6950 | 0.6950 | 0.6937 | 0.6937 | 0.6942 | 0.748889 | 0.83151 |
| 0.6819 | Random Forest | 0.6890 | 0.9873 | 0.6842 | 0.6842 | 0.6808 | 0.6808 | 0.6819 | 0.742657 | 0.999535 |
| 0.6155 | Decision Tree | 0.6217 | 0.9873 | 0.6158 | 0.6158 | 0.6154 | 0.6154 | 0.6155 | 0.614553 | 0.999673 |
| 0.3279 | Kmeans | 0.3436 | 0.3447 | 0.3265 | 0.3265 | 0.3301 | 0.3301 | 0.3279 | None | None |
'Cleaned dataset - clustering preprocessing'
| Classifier | Accuracy | Accuracy (train) | Precision | Precision (train) | Recall | Recall (train) | F1-Score (train) | ROC AUC | ROC AUC (train) | |
|---|---|---|---|---|---|---|---|---|---|---|
| F1-Score | ||||||||||
| 0.7120 | MLPClassifier | 0.7120 | 0.7178 | 0.7130 | 0.7130 | 0.7126 | 0.7126 | 0.7120 | 0.772879 | 0.782469 |
| 0.7091 | XGBClassifier | 0.7092 | 0.7329 | 0.7104 | 0.7104 | 0.7099 | 0.7099 | 0.7091 | 0.773 | 0.808088 |
| 0.7067 | Logistic Regression | 0.7068 | 0.7149 | 0.7083 | 0.7083 | 0.7076 | 0.7076 | 0.7067 | 0.768493 | 0.777507 |
| 0.7058 | LinearSVC | 0.7060 | 0.7141 | 0.7077 | 0.7077 | 0.7068 | 0.7068 | 0.7058 | None | None |
| 0.6887 | Nearest Neighbors | 0.6887 | 0.7501 | 0.6889 | 0.6889 | 0.6889 | 0.6889 | 0.6887 | 0.740162 | 0.829688 |
| 0.6817 | Random Forest | 0.6817 | 0.9837 | 0.6818 | 0.6818 | 0.6819 | 0.6819 | 0.6817 | 0.731347 | 0.999447 |
| 0.5981 | Decision Tree | 0.5981 | 0.9838 | 0.5981 | 0.5981 | 0.5981 | 0.5981 | 0.5981 | 0.597407 | 0.999474 |
| 0.3614 | Kmeans | 0.4080 | 0.3944 | 0.3667 | 0.3667 | 0.4026 | 0.4026 | 0.3614 | None | None |